1 Detect the software sentences 1.1 code snippets. It first identifies code lines by matching a strong regular expression, i.e., ".*?=.*?;$", and then expands the regions with surrounding sentences by many weak rules. -> group 1 of weak rules. a sentence that contains: "{", "}", "class", "public", "interface", "static", ");", "// ", "/*", "==" -> group 2 of weak rules. a sentence that matches: ".*?=.*?;$", "if\\s*\\(.*", "\\(*.?\\).*?;", "while\\s\\(", "=.*?\\(", "=.*?,", "\\(.*?,$", "\\(.*?\\)$", "return.*?;", "list.*?;", "response.*?;", "\\+\\+.*?;". 1.2 system messages. It is identified as the continuous sentences containing ":" or "_". A system message is a sentence that matches: "[a-zA-Z0-9]+[\\s\\pP]*:[\\s\\pP]*[a-zA-Z0-9]+", "\\w+[_]\\w+". 2 Before the above process If a line ends with "." or "?", it is usually not a software sentence. We do not need to match these sentences with the above rules. private boolean isSent(String line) { String[] end = new String[]{".", "?"}; for (int i = 0; i < end.length; ++i) { if (line.endsWith(end[i])) { return true; } } return false; } 3 After the above process If the distance of two regions is within three sentences, we also take the sentences between them as software language.